We have set out to predict the total cost for two people to stay in an Airbnb in the city of Milan for 4 nights. To select properties that are suitable we have ensured that all have private rooms, a rating of 4.5 or greater and have at least 10 reviews.
In order to undertake the analysis we have conducted a thorough exploration of the data to gain an understand of the relevant variables.During out EDA we found that a number of variables impact the price of a room in Milan. The number, and type, of services that are provided with the rental impact the price of the stay positively - this is likely due to the costs associated with these services.The type of property has a large impact on the price of the stay with Hotel rooms and entire lofts commanding the largest premiums. Neighborhoods play a large role in the price of the Airbnb allowing hosts to command higher pricess due to their location - Trei Torri, an affluent modern neighborhood, commands on average the highest prices by room.
Through creation of 8 models we were able to demonstrate and understand how a number of different variables impacted the price of our desired stay. To do this we chose between room type and property type, in a simplified version. This allowed us to consider 4 variables - bathrooms, bedrooms, beds and accomodates (the number of people the property could host). From this we selected bedrooms to run our regression analysis. Subsequently we found that the most statistically significant model was model 8 with an R Squared value of 0.304 - the highest we derived from a model. From this model we were able to find c.1100 properties in Milan that were suitable for 2 people staying 4 nights. From these we have also been able to illustrate the distribution of prices from suitable properties.
The following report walks you through our process, exploration, analysis and outputs.
#Exploratory Data Analysis for Airbnb properties in Milan
##Let’s look at the raw data
glimpse(listings)Rows: 17,703
Columns: 74
$ id <dbl> 6400, 23986, 28300, 37256~
$ listing_url <chr> "https://www.airbnb.com/r~
$ scrape_id <dbl> 2.021092e+13, 2.021092e+1~
$ last_scraped <date> 2021-09-20, 2021-09-20, ~
$ name <chr> "The Studio Milan", "\" C~
$ description <chr> "Enjoy your stay at The S~
$ neighborhood_overview <chr> "The neighborhood is quie~
$ picture_url <chr> "https://a0.muscache.com/~
$ host_id <dbl> 13822, 95941, 121663, 119~
$ host_url <chr> "https://www.airbnb.com/u~
$ host_name <chr> "Francesca", "Jeremy", "M~
$ host_since <date> 2009-04-17, 2010-03-19, ~
$ host_location <chr> "Milan, Lombardia, Italy"~
$ host_about <chr> "I'm am Francesca Sottila~
$ host_response_time <chr> "N/A", "N/A", "N/A", "N/A~
$ host_response_rate <chr> "N/A", "N/A", "N/A", "N/A~
$ host_acceptance_rate <chr> "N/A", "N/A", "N/A", "N/A~
$ host_is_superhost <lgl> FALSE, FALSE, FALSE, TRUE~
$ host_thumbnail_url <chr> "https://a0.muscache.com/~
$ host_picture_url <chr> "https://a0.muscache.com/~
$ host_neighbourhood <chr> "Zona 5", "Navigli", "Cen~
$ host_listings_count <dbl> 1, 1, 1, 2, 2, 2, 4, 1, 0~
$ host_total_listings_count <dbl> 1, 1, 1, 2, 2, 2, 4, 1, 0~
$ host_verifications <chr> "['email', 'phone', 'revi~
$ host_has_profile_pic <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ host_identity_verified <lgl> FALSE, TRUE, TRUE, TRUE, ~
$ neighbourhood <chr> "Milan, Lombardy, Italy",~
$ neighbourhood_cleansed <chr> "TIBALDI", "NAVIGLI", "SA~
$ neighbourhood_group_cleansed <lgl> NA, NA, NA, NA, NA, NA, N~
$ latitude <dbl> 45.44195, 45.44991, 45.47~
$ longitude <dbl> 9.17797, 9.17597, 9.17359~
$ property_type <chr> "Private room in rental u~
$ room_type <chr> "Private room", "Entire h~
$ accommodates <dbl> 1, 4, 2, 1, 4, 4, 5, 3, 2~
$ bathrooms <lgl> NA, NA, NA, NA, NA, NA, N~
$ bathrooms_text <chr> "3.5 baths", "1 bath", "1~
$ bedrooms <dbl> 3, 1, 1, 1, 2, 2, 2, 2, 1~
$ beds <dbl> 1, 1, 2, 1, 4, 2, 3, 1, 1~
$ amenities <chr> "[\"Hangers\", \"Iron\", ~
$ price <chr> "$100.00", "$150.00", "$1~
$ minimum_nights <dbl> 4, 1, 1, 2, 3, 2, 2, 3, 2~
$ maximum_nights <dbl> 5, 730, 14, 730, 90, 30, ~
$ minimum_minimum_nights <dbl> 4, 1, 1, 2, 3, 2, 2, 3, 2~
$ maximum_minimum_nights <dbl> 4, 1, 1, 2, 3, 2, 2, 3, 2~
$ minimum_maximum_nights <dbl> 5, 730, 14, 1125, 90, 30,~
$ maximum_maximum_nights <dbl> 5, 730, 14, 1125, 90, 30,~
$ minimum_nights_avg_ntm <dbl> 4, 1, 1, 2, 3, 2, 2, 3, 2~
$ maximum_nights_avg_ntm <dbl> 5, 730, 14, 1125, 90, 30,~
$ calendar_updated <lgl> NA, NA, NA, NA, NA, NA, N~
$ has_availability <lgl> TRUE, TRUE, TRUE, TRUE, T~
$ availability_30 <dbl> 23, 28, 30, 0, 0, 23, 0, ~
$ availability_60 <dbl> 53, 58, 60, 0, 0, 53, 0, ~
$ availability_90 <dbl> 83, 88, 90, 0, 0, 83, 0, ~
$ availability_365 <dbl> 358, 363, 365, 0, 203, 35~
$ calendar_last_scraped <date> 2021-09-20, 2021-09-20, ~
$ number_of_reviews <dbl> 12, 15, 8, 34, 37, 14, 27~
$ number_of_reviews_ltm <dbl> 0, 0, 0, 0, 0, 0, 1, 0, 0~
$ number_of_reviews_l30d <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ first_review <date> 2014-04-11, 2015-09-21, ~
$ last_review <date> 2010-04-19, 2020-09-07, ~
$ review_scores_rating <dbl> 4.89, 4.64, 4.71, 4.90, 4~
$ review_scores_accuracy <dbl> 5.00, 4.53, 4.71, 4.79, 4~
$ review_scores_cleanliness <dbl> 5.00, 4.40, 4.86, 4.90, 4~
$ review_scores_checkin <dbl> 5.00, 4.40, 4.86, 5.00, 5~
$ review_scores_communication <dbl> 5.00, 4.53, 4.86, 5.00, 4~
$ review_scores_location <dbl> 4.56, 4.53, 5.00, 5.00, 4~
$ review_scores_value <dbl> 4.67, 4.53, 5.00, 4.59, 4~
$ license <chr> NA, NA, NA, NA, NA, NA, N~
$ instant_bookable <lgl> FALSE, FALSE, FALSE, TRUE~
$ calculated_host_listings_count <dbl> 1, 1, 1, 2, 2, 1, 1, 1, 1~
$ calculated_host_listings_count_entire_homes <dbl> 0, 1, 0, 1, 2, 1, 1, 1, 1~
$ calculated_host_listings_count_private_rooms <dbl> 1, 0, 1, 1, 0, 0, 0, 0, 0~
$ calculated_host_listings_count_shared_rooms <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0~
$ reviews_per_month <dbl> 0.13, 0.21, 0.11, 0.47, 0~
listings <- listings %>%
mutate(price = parse_number(as.character(price)))
favstats(price ~ bedrooms, data=listings)| bedrooms | min | Q1 | median | Q3 | max | mean | sd | n | missing |
|---|---|---|---|---|---|---|---|---|---|
| 1 | 9 | 50 | 75 | 114 | 1.2e+04 | 107 | 247 | 12984 | 0 |
| 2 | 12 | 80 | 120 | 195 | 1e+04 | 177 | 370 | 2719 | 0 |
| 3 | 16 | 116 | 190 | 300 | 1e+04 | 301 | 713 | 463 | 0 |
| 4 | 17 | 146 | 247 | 400 | 2.5e+03 | 335 | 343 | 113 | 0 |
| 5 | 23 | 115 | 300 | 482 | 1.2e+03 | 333 | 282 | 24 | 0 |
| 6 | 90 | 191 | 292 | 392 | 493 | 292 | 285 | 2 | 0 |
| 7 | 394 | 1.17e+03 | 1.95e+03 | 2.72e+03 | 3.5e+03 | 1.95e+03 | 2.2e+03 | 2 | 0 |
| 8 | 732 | 732 | 732 | 732 | 732 | 732 | 1 | 0 | |
| 16 | 8.5e+03 | 8.5e+03 | 8.5e+03 | 8.5e+03 | 8.5e+03 | 8.5e+03 | 1 | 0 |
There are 74 variables and 17,703 observations within the AirBnB dataset.
The following variables are numbers.
#Returning indicator names with type dbl
listings %>%
select(where(is.numeric))%>%
colnames() [1] "id"
[2] "scrape_id"
[3] "host_id"
[4] "host_listings_count"
[5] "host_total_listings_count"
[6] "latitude"
[7] "longitude"
[8] "accommodates"
[9] "bedrooms"
[10] "beds"
[11] "price"
[12] "minimum_nights"
[13] "maximum_nights"
[14] "minimum_minimum_nights"
[15] "maximum_minimum_nights"
[16] "minimum_maximum_nights"
[17] "maximum_maximum_nights"
[18] "minimum_nights_avg_ntm"
[19] "maximum_nights_avg_ntm"
[20] "availability_30"
[21] "availability_60"
[22] "availability_90"
[23] "availability_365"
[24] "number_of_reviews"
[25] "number_of_reviews_ltm"
[26] "number_of_reviews_l30d"
[27] "review_scores_rating"
[28] "review_scores_accuracy"
[29] "review_scores_cleanliness"
[30] "review_scores_checkin"
[31] "review_scores_communication"
[32] "review_scores_location"
[33] "review_scores_value"
[34] "calculated_host_listings_count"
[35] "calculated_host_listings_count_entire_homes"
[36] "calculated_host_listings_count_private_rooms"
[37] "calculated_host_listings_count_shared_rooms"
[38] "reviews_per_month"
The following variables are categorical/factor.
#Returning indicator names with type character
listings %>%
select(where(is.character))%>%
colnames() [1] "listing_url" "name" "description"
[4] "neighborhood_overview" "picture_url" "host_url"
[7] "host_name" "host_location" "host_about"
[10] "host_response_time" "host_response_rate" "host_acceptance_rate"
[13] "host_thumbnail_url" "host_picture_url" "host_neighbourhood"
[16] "host_verifications" "neighbourhood" "neighbourhood_cleansed"
[19] "property_type" "room_type" "bathrooms_text"
[22] "amenities" "license"
##Let’s understand better the data with the use of some graphs
###Here we have a barchart of the number of bedrooms, but we remove very large numbers, so in our case the properties that have more than 5 bedrooms.
listings %>%
filter(bedrooms<=5) %>%
ggplot(aes(x=bedrooms))+
geom_bar()+
labs(title="Number of Airbnb properties in Milan grouped by bedrooms", x="Bedrooms",y="Number of properties")+
NULL###Here we have a histogram to understand the distribution of the average reviews for properties in Milan. As we can see from the graph, the vast majority of properties have ratings above 4.
listings %>%
ggplot(aes(x=review_scores_rating))+
geom_histogram()+
labs(title="Distribution of ratings per Airbnb property in Milan", x="Ratings",y="Number of properties")+
NULL###Here we have a box plot to understand the distribution of the number of ratings per Airbnb property. We filter out the data and only analyze properties that have more than 100 reviews so to remove the properties that haven’t been long enough on the “market” and hence haven’t been used a lot.
listings %>%
filter(number_of_reviews>=100) %>%
ggplot(aes(x=number_of_reviews))+
geom_boxplot()+
labs(title="Boxplot of the number of reviews per Airbnb property in Milan", x="Number of Reviews")+
NULL###Here we have a density plot to understand the distribution of price per Airbnb property. We filter out the data and only analyze properties that have a price per night of less than 300 so to remove the outliers made by the properties that can be considered as “luxury”.
listings <- listings %>% #Changing price from str to numeric data type
mutate(price = parse_number(as.character(price))) %>%
mutate(neighbourhood_simplified = ifelse(longitude <= 9.17279 & latitude <= 45.462395, "Southwest",
ifelse(longitude <= 9.17279 & latitude > 45.462395, "Northwest",
ifelse(longitude > 9.17279 & latitude <= 45.462395, "Southeast", "Northeast"))))
listings %>%
filter(price<=300) %>%
ggplot(aes(x=price))+
geom_density()+
labs(title="Distribution of the price per night per Airbnb property in Milan", x="Price per night",y="Density")+
NULL ## Let’s look at the property types in more detail. Here are some numbers:
proportion_listing <- listings %>%
group_by(property_type) %>%
count() %>%
mutate(pct = scales::percent(n / 17703))
proportion_listing %>%
arrange(desc(n))# A tibble: 52 x 3
# Groups: property_type [52]
property_type n pct
<chr> <int> <chr>
1 Entire rental unit 10178 57%
2 Private room in rental unit 2661 15%
3 Entire condominium (condo) 1830 10%
4 Entire loft 833 5%
5 Private room in condominium (condo) 631 4%
6 Entire residential home 274 2%
7 Entire serviced apartment 189 1%
8 Shared room in rental unit 182 1%
9 Private room in residential home 168 1%
10 Private room in bed and breakfast 137 1%
# ... with 42 more rows
The 4 most common property types are ‘entire rental unit’, ‘private room in rental unit’, ‘entire condo’ and ‘entire loft’. These property types make up a combined 87% of the properties. (57%, 15%, 10% and 5% respectively).
Since the vast majority of the observations in the data are one of the top four or five property types, we have chosen to create a simplified version of property_type variable that has 5 categories: the top four categories and Other.
listings <- listings %>%
mutate(prop_type_simplified = case_when(
property_type %in% c("Entire rental unit","Private room in rental unit", "Entire condominium (condo)","Entire loft") ~ property_type,
TRUE ~ "Other"
))listings %>%
count(property_type, prop_type_simplified) %>%
arrange(desc(n)) | property_type | prop_type_simplified | n |
|---|---|---|
| Entire rental unit | Entire rental unit | 10178 |
| Private room in rental unit | Private room in rental unit | 2661 |
| Entire condominium (condo) | Entire condominium (condo) | 1830 |
| Entire loft | Entire loft | 833 |
| Private room in condominium (condo) | Other | 631 |
| Entire residential home | Other | 274 |
| Entire serviced apartment | Other | 189 |
| Shared room in rental unit | Other | 182 |
| Private room in residential home | Other | 168 |
| Private room in bed and breakfast | Other | 137 |
| Private room in loft | Other | 91 |
| Room in boutique hotel | Other | 56 |
| Room in hotel | Other | 56 |
| Private room in serviced apartment | Other | 36 |
| Shared room in condominium (condo) | Other | 33 |
| Private room in villa | Other | 30 |
| Tiny house | Other | 26 |
| Room in aparthotel | Other | 24 |
| Entire guest suite | Other | 22 |
| Private room in guest suite | Other | 21 |
| Private room in townhouse | Other | 20 |
| Entire villa | Other | 19 |
| Room in bed and breakfast | Other | 19 |
| Entire townhouse | Other | 18 |
| Private room in hostel | Other | 17 |
| Room in serviced apartment | Other | 17 |
| Shared room in hostel | Other | 17 |
| Room in hostel | Other | 12 |
| Private room | Other | 10 |
| Entire place | Other | 9 |
| Shared room in loft | Other | 9 |
| Private room in tiny house | Other | 8 |
| Shared room in residential home | Other | 8 |
| Private room in casa particular | Other | 5 |
| Shared room in bed and breakfast | Other | 5 |
| Private room in guesthouse | Other | 4 |
| Casa particular | Other | 3 |
| Entire bed and breakfast | Other | 3 |
| Entire guesthouse | Other | 3 |
| Entire home/apt | Other | 3 |
| Shared room in tiny house | Other | 3 |
| Camper/RV | Other | 2 |
| Dome house | Other | 2 |
| Boat | Other | 1 |
| Cave | Other | 1 |
| Earth house | Other | 1 |
| Island | Other | 1 |
| Private room in camper/rv | Other | 1 |
| Private room in cave | Other | 1 |
| Private room in earth house | Other | 1 |
| Private room in farm stay | Other | 1 |
| Tipi | Other | 1 |
We will now look at the correlation between selected variables in the dataset.
listings %>% #Correlation between availability and price
select(where(is.numeric)) %>%
select(price, availability_30,availability_60,availability_90,availability_365) %>%
ggpairs(aes(alpha=0.2))+
theme_bw() As per the graph the correlation between availability and price is not significantly high. This highlights that availablity of rooms does not affect the price.
listings %>% #Correlation between review and price
select(price, bedrooms,beds,review_scores_rating,review_scores_accuracy, review_scores_cleanliness,review_scores_checkin,
review_scores_communication,review_scores_location,review_scores_value ) %>%
ggpairs(aes(alpha=0.2))+
theme_bw() As per the graph the correlation between ratings and price is not significantly high. This highlights that potentially lowered priced rooms receive a high rating, this signifies that customers care about value for money. There exists a significant correlation between the number of beds and price.
listings %>%
group_by(prop_type_simplified) %>%
summarise(avg_price = mean(price)) %>%
ggplot(aes(x = prop_type_simplified, y = avg_price)) +
geom_col() +
labs(title = "Average Property Price of Different Property Types",
x = "Property Type",
y = "Average Price Per Night") The barchart shown above implies that the entire loft would have the highest average price among all the property type, while private room in rental unit ranked the lowest. That makes sense to me since loft tends to have modern furniture than traditional type of building especially in European historic old cities like Milan. Also, loft is more spacious than other types, based on the personal experience of Francesco (our Italian group member). In addition, private room needs to share the living room with other tenants, which would reduce the comfortness of customers.
listings %>%
group_by(room_type) %>%
summarise(avg_price = mean(price)) %>%
ggplot(aes(x = room_type, y = avg_price)) +
geom_col() +
labs(title = "Average Property Price of Different Room Types",
x = "Room Type",
y = "Average Price Per Night") The barchart shown above implies that the hotel room has a much higher average price than any other room type, since customers need to pay for the premium of cleaning, security, free breakfast etc. In comparison, shared room has the lowest average price among all types, since the space needs to be shared with someone else.
listings %>%
group_by(neighbourhood_cleansed) %>%
summarise(avg_price = mean(price)) %>%
ggplot(aes(x = avg_price, y = neighbourhood_cleansed)) +
geom_col() +
labs(title = "Average Property Price of Different Neighbourhoods",
x = "Neighbour",
y = "Average Price Per Night") Tre Torri has the highest average property price among all the neighbors. Tre Torri is located in the centre of the three towers, which can serve a substantial number of employees working in high-caliber companies. The facilities in this area is extremely modern, with only 14 years of history after groundbreaking, accompanied with a lot of parks for entertainment. Ronchetto delle Rane, on the other hand, has the lowest average property price, since it’s located in suburb of Milan with outdated facilities.Tre Torri has the highest average property price among all the neighbors. Tre Torri is located in the centre of the three towers, which can serve a substantial number of employees working in high-caliber companies. The facilities in this area is extremely modern, with only 14 years of history after groundbreaking, accompanied with a lot of parks for entertainment. Ronchetto delle Rane, on the other hand, has the lowest average property price, since it’s located in suburb of Milan with outdated facilities.
correlation_matrix_data_1 <- listings %>%
select(price,bedrooms, accommodates)
corr <- round(cor(correlation_matrix_data_1), 1)
ggcorrplot(corr) #Changing price from str to numeric data type
listings <- listings %>%
mutate(price = parse_number(as.character(price)))typeof(listings$price)[1] "double"
We have confirmed that price is now formatted as a number.
Airbnb is most commonly used for travel purposes, i.e., as an alternative to traditional hotels. We only want to include listings in our regression analysis that are intended for travel purposes:
The minimum nights that the Airbnb reported the most usually lies between 1 and 3.
nights_listing <- listings %>%
group_by(minimum_nights) %>%
count() %>%
mutate(pct = scales::percent(n / 17703))
nights_listing %>%
arrange(desc(n))# A tibble: 69 x 3
# Groups: minimum_nights [69]
minimum_nights n pct
<dbl> <int> <chr>
1 1 6853 39%
2 2 5548 31%
3 3 2246 13%
4 4 653 4%
5 5 571 3%
6 7 459 3%
7 30 291 2%
8 6 164 1%
9 15 145 1%
10 29 122 1%
# ... with 59 more rows
The number of minimum nights that stands out is 30 days. A possible explanation is that the host prefers long term lettings. Furthermore, Airbnb wants them to stay longer; in that way, the capacity of the property can be increased, reducing the business risk. Another stand out duration of stay is the minimum of 7 nights, which is above a minimum of 6 nights, encouraging people to stay one entire week benefiting the host to reduce hassle.
We have filtered the data so that it shows the minimum nights as less than or equal to 4 nights.
listings_4nights <- listings %>%
filter(minimum_nights <= 4)
#Check if we have derived the dataset that included minimum_nights <= 4 only
listings_4nights %>%
group_by(minimum_nights) %>%
count()# A tibble: 4 x 2
# Groups: minimum_nights [4]
minimum_nights n
<dbl> <int>
1 1 6853
2 2 5548
3 3 2246
4 4 653
listings %>%
filter(minimum_nights <= 4) %>%
ggplot(aes(x=minimum_nights))+
geom_bar()+
labs(title="Number of properties in Milan grouped by minimum nights",
subtitle="We only consider properties that have 4 or fewer minimum nights",
x="Minimum nights",
y="Number of properties")+
NULLleaflet(data = filter(listings, minimum_nights <= 4)) %>%
addProviderTiles("OpenStreetMap.Mapnik") %>%
addCircleMarkers(lng = ~longitude,
lat = ~latitude,
radius = 1,
fillColor = "blue",
fillOpacity = 0.4,
popup = ~listing_url,
label = ~property_type)We have created a new variable called ‘price_4_nights’ using ‘price’ and ‘accomodates’ to calculate the total cost for two people to stay at the Airbnb property for 4 nights.
listings_4_nights_2_people <- listings %>%
filter(minimum_nights <= 4 , maximum_nights >= 4, accommodates >=2)
listings_4_nights_2_people <- listings_4_nights_2_people %>%
mutate(price_4_nights = price*4)We should use og adjusted prices for the regression analysis as the variable is exnibiting a normal distribution.
ggplot(data=listings_4_nights_2_people, aes(x= price_4_nights)) +
geom_histogram() +
scale_x_continuous(limits=c(0,1000)) +
labs(title = 'Price distribution for accomodations in Milan for 4 days and 2 people', x = "Price", y = "Count") +
theme_bw()ggplot(data=listings_4_nights_2_people, aes(x= log(price_4_nights))) +
geom_histogram() +
scale_x_continuous() +
labs(title = 'Log adjusted price distribution for accomodations in Milan for 4 days and 2 people', x = "Price", y = "Count") +
theme_bw() Comment: We would choose to use log(price_4_nights) for regression purpose, since we would derive a normal-distributed graph after taking the log of price. By doing so, the model is more consistent with the typical assumption of OLS analysis.
On the other hand, distribution of the price_4_nights is right-skewed, which would lead to the distortion the regression model (the coefficient would tend to be overvalued).
We have created a regression model called model1 with the following explanatory variables: prop_type_simplified, number_of_reviews, and review_scores_rating.
log_listings_4_nights_2_people <- listings_4_nights_2_people %>% #Model 1 - Type of listing
mutate(price_4_nights = log(price_4_nights))
model1 <- lm(price_4_nights ~
prop_type_simplified +
number_of_reviews +
review_scores_rating,
data = log_listings_4_nights_2_people)
log_listings_4_nights_2_people %>%
group_by(prop_type_simplified) %>%
summarise(count=n())| prop_type_simplified | count |
|---|---|
| Entire condominium (condo) | 1475 |
| Entire loft | 717 |
| Entire rental unit | 8629 |
| Other | 1518 |
| Private room in rental unit | 1722 |
autoplot(model1)+ theme_bw()get_regression_table(model1) | term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | 6.01 | 0.039 | 153 | 0 | 5.94 | 6.09 |
| prop_type_simplified: Entire loft | 0.173 | 0.032 | 5.35 | 0 | 0.11 | 0.236 |
| prop_type_simplified: Entire rental unit | 0.094 | 0.021 | 4.47 | 0 | 0.053 | 0.135 |
| prop_type_simplified: Other | -0.093 | 0.027 | -3.43 | 0.001 | -0.146 | -0.04 |
| prop_type_simplified: Private room in rental unit | -0.389 | 0.026 | -15 | 0 | -0.44 | -0.339 |
| number_of_reviews | -0.001 | 0 | -14.4 | 0 | -0.001 | -0.001 |
| review_scores_rating | -0.029 | 0.007 | -3.89 | 0 | -0.043 | -0.014 |
get_regression_summaries(model1)| r_squared | adj_r_squared | mse | rmse | sigma | statistic | p_value | df | nobs |
|---|---|---|---|---|---|---|---|---|
| 0.081 | 0.08 | 0.367 | 0.606 | 0.606 | 159 | 0 | 6 | 1.09e+04 |
mosaic::msummary(model1) Estimate Std. Error t value
(Intercept) 6.013e+00 3.935e-02 152.820
prop_type_simplifiedEntire loft 1.731e-01 3.234e-02 5.351
prop_type_simplifiedEntire rental unit 9.364e-02 2.095e-02 4.470
prop_type_simplifiedOther -9.258e-02 2.700e-02 -3.428
prop_type_simplifiedPrivate room in rental unit -3.895e-01 2.589e-02 -15.046
number_of_reviews -1.192e-03 8.293e-05 -14.376
review_scores_rating -2.868e-02 7.372e-03 -3.891
Pr(>|t|)
(Intercept) < 2e-16 ***
prop_type_simplifiedEntire loft 8.94e-08 ***
prop_type_simplifiedEntire rental unit 7.90e-06 ***
prop_type_simplifiedOther 0.000609 ***
prop_type_simplifiedPrivate room in rental unit < 2e-16 ***
number_of_reviews < 2e-16 ***
review_scores_rating 0.000100 ***
Residual standard error: 0.606 on 10864 degrees of freedom
(3190 observations deleted due to missingness)
Multiple R-squared: 0.08094, Adjusted R-squared: 0.08043
F-statistic: 159.5 on 6 and 10864 DF, p-value: < 2.2e-16
car::vif(model1) GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 1.006827 4 1.000851
number_of_reviews 1.013385 1 1.006670
review_scores_rating 1.009113 1 1.004546
Comment:Review_scores_rating is negatively correlated with the price, since the t-stat is negative. The review_scores_rating is significant is in predicting the price, as it has a absolute t-stat of 3.891 (which is greater than 2, the t-value corrsponding to the 95% confidence level).
prop_type_simplified is statistically significant in predicting the price, since all of the property types (including entire loft, entire rental unit, other, and private room in rental unit) has an absolute t-value which is greater than 2. According to their signs, we are confident in concluding that entire loft and entire rental unit would contribute to the increase in price, while private room in rental unit and other type would lead to the decrease in price. Among all the property type, private room in rental unit would make the hugest impact on price, as suggested by the size of coefficient (-15.046).
We want to determine if room_type is a significant predictor of the cost for 4 nights, given everything else in the model. We have created a regression model called model2 that includes all of the explananatory variables in model1 plus room_type.
model2 <- lm(price_4_nights ~
prop_type_simplified +
number_of_reviews +
review_scores_rating +
room_type,
data = log_listings_4_nights_2_people)
log_listings_4_nights_2_people %>%
group_by(room_type) %>%
summarise(count=n())| room_type | count |
|---|---|
| Entire home/apt | 11335 |
| Hotel room | 61 |
| Private room | 2581 |
| Shared room | 84 |
autoplot(model2)+ theme_bw()get_regression_table(model2) | term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | 6.02 | 0.039 | 156 | 0 | 5.94 | 6.09 |
| prop_type_simplified: Entire loft | 0.172 | 0.032 | 5.44 | 0 | 0.11 | 0.234 |
| prop_type_simplified: Entire rental unit | 0.093 | 0.021 | 4.54 | 0 | 0.053 | 0.133 |
| prop_type_simplified: Other | 0.327 | 0.036 | 9.11 | 0 | 0.256 | 0.397 |
| prop_type_simplified: Private room in rental unit | 0.274 | 0.047 | 5.89 | 0 | 0.183 | 0.366 |
| number_of_reviews | -0.001 | 0 | -14.3 | 0 | -0.001 | -0.001 |
| review_scores_rating | -0.029 | 0.007 | -4.06 | 0 | -0.044 | -0.015 |
| room_type: Hotel room | 0.194 | 0.091 | 2.13 | 0.033 | 0.016 | 0.372 |
| room_type: Private room | -0.664 | 0.039 | -17 | 0 | -0.741 | -0.588 |
| room_type: Shared room | -1.26 | 0.083 | -15.2 | 0 | -1.42 | -1.1 |
get_regression_summaries(model2)| r_squared | adj_r_squared | mse | rmse | sigma | statistic | p_value | df | nobs |
|---|---|---|---|---|---|---|---|---|
| 0.118 | 0.118 | 0.352 | 0.593 | 0.594 | 162 | 0 | 9 | 1.09e+04 |
mosaic::msummary(model2) Estimate Std. Error t value
(Intercept) 6.016e+00 3.857e-02 155.950
prop_type_simplifiedEntire loft 1.724e-01 3.168e-02 5.441
prop_type_simplifiedEntire rental unit 9.317e-02 2.052e-02 4.541
prop_type_simplifiedOther 3.267e-01 3.585e-02 9.112
prop_type_simplifiedPrivate room in rental unit 2.744e-01 4.658e-02 5.891
number_of_reviews -1.160e-03 8.128e-05 -14.271
review_scores_rating -2.938e-02 7.228e-03 -4.064
room_typeHotel room 1.939e-01 9.089e-02 2.134
room_typePrivate room -6.641e-01 3.908e-02 -16.995
room_typeShared room -1.262e+00 8.304e-02 -15.201
Pr(>|t|)
(Intercept) < 2e-16 ***
prop_type_simplifiedEntire loft 5.41e-08 ***
prop_type_simplifiedEntire rental unit 5.67e-06 ***
prop_type_simplifiedOther < 2e-16 ***
prop_type_simplifiedPrivate room in rental unit 3.95e-09 ***
number_of_reviews < 2e-16 ***
review_scores_rating 4.85e-05 ***
room_typeHotel room 0.0329 *
room_typePrivate room < 2e-16 ***
room_typeShared room < 2e-16 ***
Residual standard error: 0.5936 on 10861 degrees of freedom
(3190 observations deleted due to missingness)
Multiple R-squared: 0.1183, Adjusted R-squared: 0.1176
F-statistic: 161.9 on 9 and 10861 DF, p-value: < 2.2e-16
car::vif(model2) GVIF Df GVIF^(1/(2*Df))
prop_type_simplified 7.399788 4 1.284258
number_of_reviews 1.014343 1 1.007146
review_scores_rating 1.010966 1 1.005468
room_type 7.377840 3 1.395259
Comment: After running model 2, we found out that all the room type (including hotel room, private room, shared room) are statistically significant (5% significance level) in explaining the movement in price,since their above t-stat all lies above 2. More specifically, the hotel room would lead to the increase in rental price, while private room and shared room would make an opposite effect, with the underlying reasons stated above in EDA.
However, after we add the variables “room_type”, we found out that the coefficients of private room in rental unit and other property types has changed from negative to positive. Therefore, it’s reasonable to doubt whether adding the new variable has affected the explanatory power of the original variable. By looking at the VIF, we found out that answer: there exists co-linearity between prop_type_simplified and room_type, as their VIF are greater than 5.
Having known that they are co-linear, We want to determine which one we should keep to proceed with the analysis. Therefore, in model 2.2, we drop prop_type_simplified to compare with model 1.
model2.2 <- lm(price_4_nights ~
number_of_reviews +
review_scores_rating +
room_type,
data = log_listings_4_nights_2_people)
autoplot(model2.2)+ theme_bw()get_regression_table(model2.2) | term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | 6.12 | 0.034 | 181 | 0 | 6.05 | 6.18 |
| number_of_reviews | -0.001 | 0 | -14.2 | 0 | -0.001 | -0.001 |
| review_scores_rating | -0.03 | 0.007 | -4.19 | 0 | -0.045 | -0.016 |
| room_type: Hotel room | 0.422 | 0.086 | 4.89 | 0 | 0.253 | 0.591 |
| room_type: Private room | -0.472 | 0.015 | -31.1 | 0 | -0.502 | -0.442 |
| room_type: Shared room | -1.03 | 0.078 | -13.3 | 0 | -1.19 | -0.881 |
get_regression_summaries(model2.2)| r_squared | adj_r_squared | mse | rmse | sigma | statistic | p_value | df | nobs |
|---|---|---|---|---|---|---|---|---|
| 0.111 | 0.11 | 0.355 | 0.596 | 0.596 | 270 | 0 | 5 | 1.09e+04 |
mosaic::msummary(model2.2) Estimate Std. Error t value Pr(>|t|)
(Intercept) 6.119e+00 3.385e-02 180.731 < 2e-16 ***
number_of_reviews -1.154e-03 8.141e-05 -14.177 < 2e-16 ***
review_scores_rating -3.041e-02 7.256e-03 -4.191 2.80e-05 ***
room_typeHotel room 4.221e-01 8.633e-02 4.890 1.02e-06 ***
room_typePrivate room -4.719e-01 1.518e-02 -31.089 < 2e-16 ***
room_typeShared room -1.034e+00 7.794e-02 -13.269 < 2e-16 ***
Residual standard error: 0.5961 on 10865 degrees of freedom
(3190 observations deleted due to missingness)
Multiple R-squared: 0.1106, Adjusted R-squared: 0.1101
F-statistic: 270.1 on 5 and 10865 DF, p-value: < 2.2e-16
car::vif(model2.2) GVIF Df GVIF^(1/(2*Df))
number_of_reviews 1.009077 1 1.004528
review_scores_rating 1.010210 1 1.005092
room_type 1.003841 3 1.000639
Comment: After running model 2.2, we found out that the explanatory power of room_type is much stronger than that of prop_type_simplified, as the adjust R-square has increased by roughly 0.03. Therefore, we only keep room_type in the following analysis.
Our dataset has many more variables, so here are some ideas on how we can extend our analysis
Q1. Are the number of bathrooms, bedrooms, beds, or size of the house (accomodates) significant predictors of price_4_nights? Or might these be co-linear variables?
But first, we need to adjust the data type for bathrooms to make it available for using.
log_listings_4_nights_2_people <- log_listings_4_nights_2_people %>%
mutate(bathrooms_clean = parse_number(bathrooms_text))correlation_matrix_data_2 <- log_listings_4_nights_2_people %>%
select(price, bedrooms, bathrooms,beds)
corr <- round(cor(correlation_matrix_data_2), 1)
ggcorrplot(corr)log_listings_4_nights_2_people %>% #Correlation between review and price
select(price, bathrooms_clean, bedrooms,beds, accommodates) %>%
ggpairs(aes(alpha=0.2))+
theme_bw()model3 <- lm(price_4_nights ~ #Including bathrooms, beds, bedrooms and accommodated in the explanatory variables
number_of_reviews +
review_scores_rating +
room_type+
bathrooms_clean+
bedrooms+
beds+
accommodates,
data = log_listings_4_nights_2_people)
autoplot(model3)+ theme_bw()get_regression_table(model3) | term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | 5.44 | 0.037 | 146 | 0 | 5.37 | 5.51 |
| number_of_reviews | -0.001 | 0 | -15.2 | 0 | -0.001 | -0.001 |
| review_scores_rating | -0.027 | 0.007 | -3.78 | 0 | -0.04 | -0.013 |
| room_type: Hotel room | 0.497 | 0.083 | 5.98 | 0 | 0.334 | 0.66 |
| room_type: Private room | -0.365 | 0.016 | -23.2 | 0 | -0.396 | -0.334 |
| room_type: Shared room | -0.919 | 0.073 | -12.5 | 0 | -1.06 | -0.775 |
| bathrooms_clean | 0.246 | 0.017 | 14.7 | 0 | 0.213 | 0.279 |
| bedrooms | 0.185 | 0.015 | 12.3 | 0 | 0.156 | 0.215 |
| beds | -0.02 | 0.007 | -2.81 | 0.005 | -0.034 | -0.006 |
| accommodates | 0.054 | 0.006 | 8.44 | 0 | 0.041 | 0.066 |
get_regression_summaries(model3)| r_squared | adj_r_squared | mse | rmse | sigma | statistic | p_value | df | nobs |
|---|---|---|---|---|---|---|---|---|
| 0.246 | 0.245 | 0.308 | 0.555 | 0.555 | 360 | 0 | 9 | 9.97e+03 |
mosaic::msummary(model3) Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.442e+00 3.734e-02 145.762 < 2e-16 ***
number_of_reviews -1.214e-03 8.014e-05 -15.151 < 2e-16 ***
review_scores_rating -2.659e-02 7.033e-03 -3.781 0.000157 ***
room_typeHotel room 4.972e-01 8.310e-02 5.983 2.27e-09 ***
room_typePrivate room -3.649e-01 1.570e-02 -23.247 < 2e-16 ***
room_typeShared room -9.194e-01 7.347e-02 -12.513 < 2e-16 ***
bathrooms_clean 2.463e-01 1.681e-02 14.658 < 2e-16 ***
bedrooms 1.853e-01 1.504e-02 12.321 < 2e-16 ***
beds -1.976e-02 7.030e-03 -2.810 0.004962 **
accommodates 5.384e-02 6.377e-03 8.442 < 2e-16 ***
Residual standard error: 0.5552 on 9961 degrees of freedom
(4090 observations deleted due to missingness)
Multiple R-squared: 0.2456, Adjusted R-squared: 0.2449
F-statistic: 360.3 on 9 and 9961 DF, p-value: < 2.2e-16
car::vif(model3) GVIF Df GVIF^(1/(2*Df))
number_of_reviews 1.011838 1 1.005901
review_scores_rating 1.010564 1 1.005268
room_type 1.189674 3 1.029370
bathrooms_clean 1.646086 1 1.282999
bedrooms 2.383258 1 1.543780
beds 2.276964 1 1.508961
accommodates 2.857174 1 1.690318
Comments: We did not identify any GVIF figure above 5 in the regression run. However, after running the correlation analysis above, we do observe the high correlations between the four variables, including “bedroom”, “bathrooms”, “bed”, and “accommodate”, which we consider intuitively reasonable. Therefore, to arrive at a regression model which is as powerful as possible, we decided to only keep one variable from the four to proceed.
We want to determine which one we should keep among bathrooms, bedrooms, beds, and accommodates, to proceed with the analysis.
model3.2 <- lm(price_4_nights ~ #keep bathrooms
number_of_reviews +
review_scores_rating +
bathrooms_clean+
room_type,
data = log_listings_4_nights_2_people)
mosaic::msummary(model3.2) Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.5698793 0.0355384 156.729 < 2e-16 ***
number_of_reviews -0.0011674 0.0000771 -15.141 < 2e-16 ***
review_scores_rating -0.0301376 0.0068698 -4.387 1.16e-05 ***
bathrooms_clean 0.4735712 0.0132128 35.842 < 2e-16 ***
room_typeHotel room 0.4857933 0.0833597 5.828 5.78e-09 ***
room_typePrivate room -0.4471458 0.0143990 -31.054 < 2e-16 ***
room_typeShared room -1.0002917 0.0743096 -13.461 < 2e-16 ***
Residual standard error: 0.5635 on 10845 degrees of freedom
(3209 observations deleted due to missingness)
Multiple R-squared: 0.2038, Adjusted R-squared: 0.2033
F-statistic: 462.6 on 6 and 10845 DF, p-value: < 2.2e-16
car::vif(model3.2) GVIF Df GVIF^(1/(2*Df))
number_of_reviews 1.009105 1 1.004542
review_scores_rating 1.010265 1 1.005119
bathrooms_clean 1.002381 1 1.001190
room_type 1.006311 3 1.001049
model3.3 <- lm(price_4_nights ~ #keep bedrooms
number_of_reviews +
review_scores_rating +
bedrooms+
room_type,
data = log_listings_4_nights_2_people)
mosaic::msummary(model3.3) Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.654e+00 3.593e-02 157.375 < 2e-16 ***
number_of_reviews -1.175e-03 8.117e-05 -14.476 < 2e-16 ***
review_scores_rating -2.729e-02 7.110e-03 -3.838 0.000125 ***
bedrooms 3.611e-01 1.009e-02 35.789 < 2e-16 ***
room_typeHotel room 4.406e-01 8.263e-02 5.331 9.95e-08 ***
room_typePrivate room -3.971e-01 1.487e-02 -26.710 < 2e-16 ***
room_typeShared room -9.434e-01 7.392e-02 -12.762 < 2e-16 ***
Residual standard error: 0.5645 on 10021 degrees of freedom
(4033 observations deleted due to missingness)
Multiple R-squared: 0.2213, Adjusted R-squared: 0.2208
F-statistic: 474.7 on 6 and 10021 DF, p-value: < 2.2e-16
car::vif(model3.3) GVIF Df GVIF^(1/(2*Df))
number_of_reviews 1.009046 1 1.004513
review_scores_rating 1.010345 1 1.005159
bedrooms 1.038187 1 1.018915
room_type 1.041589 3 1.006814
model3.4 <- lm(price_4_nights ~ #keep beds
number_of_reviews +
review_scores_rating +
beds+
room_type,
data = log_listings_4_nights_2_people)
mosaic::msummary(model3.4) Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.876e+00 3.443e-02 170.650 < 2e-16 ***
number_of_reviews -1.219e-03 7.926e-05 -15.386 < 2e-16 ***
review_scores_rating -2.814e-02 7.085e-03 -3.973 7.16e-05 ***
beds 1.209e-01 4.862e-03 24.863 < 2e-16 ***
room_typeHotel room 4.244e-01 8.398e-02 5.054 4.40e-07 ***
room_typePrivate room -3.949e-01 1.513e-02 -26.104 < 2e-16 ***
room_typeShared room -9.966e-01 7.584e-02 -13.141 < 2e-16 ***
Residual standard error: 0.5799 on 10819 degrees of freedom
(3235 observations deleted due to missingness)
Multiple R-squared: 0.1581, Adjusted R-squared: 0.1576
F-statistic: 338.5 on 6 and 10819 DF, p-value: < 2.2e-16
car::vif(model3.4) GVIF Df GVIF^(1/(2*Df))
number_of_reviews 1.009965 1 1.004970
review_scores_rating 1.010254 1 1.005114
beds 1.042868 1 1.021209
room_type 1.045573 3 1.007455
model3.5 <- lm(price_4_nights ~ #keep accomodates
number_of_reviews +
review_scores_rating +
accommodates+
room_type,
data = log_listings_4_nights_2_people)
mosaic::msummary(model3.5) Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.640e+00 3.516e-02 160.432 < 2e-16 ***
number_of_reviews -1.251e-03 7.747e-05 -16.145 < 2e-16 ***
review_scores_rating -2.720e-02 6.901e-03 -3.941 8.16e-05 ***
accommodates 1.344e-01 3.967e-03 33.892 < 2e-16 ***
room_typeHotel room 4.792e-01 8.212e-02 5.836 5.49e-09 ***
room_typePrivate room -2.999e-01 1.530e-02 -19.600 < 2e-16 ***
room_typeShared room -8.821e-01 7.426e-02 -11.878 < 2e-16 ***
Residual standard error: 0.5669 on 10864 degrees of freedom
(3190 observations deleted due to missingness)
Multiple R-squared: 0.1956, Adjusted R-squared: 0.1952
F-statistic: 440.3 on 6 and 10864 DF, p-value: < 2.2e-16
car::vif(model3.5) GVIF Df GVIF^(1/(2*Df))
number_of_reviews 1.010446 1 1.005209
review_scores_rating 1.010401 1 1.005187
accommodates 1.128111 1 1.062126
room_type 1.130313 3 1.020626
Comments: The Adjusted R-squared for model 3.2, 3.3, 3.4, 3.5 are 0.2033, 0.2208, 0.1576, 0.1952. Therefore, we keep bedrooms and exclude the rest of it.
model4 <- lm(price_4_nights ~ #removing bathrooms, beds, and accommodates to correct for the effect of multi-collinearity among these variables
number_of_reviews +
review_scores_rating +
room_type+
bedrooms,
data = log_listings_4_nights_2_people)
autoplot(model4)+ theme_bw()get_regression_table(model4) | term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | 5.65 | 0.036 | 157 | 0 | 5.58 | 5.72 |
| number_of_reviews | -0.001 | 0 | -14.5 | 0 | -0.001 | -0.001 |
| review_scores_rating | -0.027 | 0.007 | -3.84 | 0 | -0.041 | -0.013 |
| room_type: Hotel room | 0.441 | 0.083 | 5.33 | 0 | 0.279 | 0.603 |
| room_type: Private room | -0.397 | 0.015 | -26.7 | 0 | -0.426 | -0.368 |
| room_type: Shared room | -0.943 | 0.074 | -12.8 | 0 | -1.09 | -0.798 |
| bedrooms | 0.361 | 0.01 | 35.8 | 0 | 0.341 | 0.381 |
get_regression_summaries(model4)| r_squared | adj_r_squared | mse | rmse | sigma | statistic | p_value | df | nobs |
|---|---|---|---|---|---|---|---|---|
| 0.221 | 0.221 | 0.318 | 0.564 | 0.565 | 475 | 0 | 6 | 1e+04 |
mosaic::msummary(model4) Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.654e+00 3.593e-02 157.375 < 2e-16 ***
number_of_reviews -1.175e-03 8.117e-05 -14.476 < 2e-16 ***
review_scores_rating -2.729e-02 7.110e-03 -3.838 0.000125 ***
room_typeHotel room 4.406e-01 8.263e-02 5.331 9.95e-08 ***
room_typePrivate room -3.971e-01 1.487e-02 -26.710 < 2e-16 ***
room_typeShared room -9.434e-01 7.392e-02 -12.762 < 2e-16 ***
bedrooms 3.611e-01 1.009e-02 35.789 < 2e-16 ***
Residual standard error: 0.5645 on 10021 degrees of freedom
(4033 observations deleted due to missingness)
Multiple R-squared: 0.2213, Adjusted R-squared: 0.2208
F-statistic: 474.7 on 6 and 10021 DF, p-value: < 2.2e-16
car::vif(model4) GVIF Df GVIF^(1/(2*Df))
number_of_reviews 1.009046 1 1.004513
review_scores_rating 1.010345 1 1.005159
room_type 1.041589 3 1.006814
bedrooms 1.038187 1 1.018915
Q2. Do superhosts (host_is_superhost) command a pricing premium, after controlling for other variables?
model5 <- lm(price_4_nights ~ #adding host_is_superhost
number_of_reviews +
review_scores_rating +
room_type+
bedrooms+
host_is_superhost,
data = log_listings_4_nights_2_people)
autoplot(model5)+ theme_bw()get_regression_table(model5) | term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | 5.64 | 0.036 | 157 | 0 | 5.57 | 5.71 |
| number_of_reviews | -0.001 | 0 | -13.2 | 0 | -0.001 | -0.001 |
| review_scores_rating | -0.023 | 0.007 | -3.23 | 0.001 | -0.037 | -0.009 |
| room_type: Hotel room | 0.449 | 0.083 | 5.43 | 0 | 0.287 | 0.611 |
| room_type: Private room | -0.399 | 0.015 | -26.8 | 0 | -0.428 | -0.37 |
| room_type: Shared room | -0.949 | 0.074 | -12.8 | 0 | -1.09 | -0.804 |
| bedrooms | 0.361 | 0.01 | 35.9 | 0 | 0.342 | 0.381 |
| host_is_superhostTRUE | -0.054 | 0.014 | -3.81 | 0 | -0.082 | -0.026 |
get_regression_summaries(model5)| r_squared | adj_r_squared | mse | rmse | sigma | statistic | p_value | df | nobs |
|---|---|---|---|---|---|---|---|---|
| 0.222 | 0.222 | 0.318 | 0.564 | 0.564 | 410 | 0 | 7 | 1e+04 |
mosaic::msummary(model5) Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.6441760 0.0359902 156.826 < 2e-16 ***
number_of_reviews -0.0010985 0.0000835 -13.156 < 2e-16 ***
review_scores_rating -0.0232280 0.0071872 -3.232 0.001234 **
room_typeHotel room 0.4487310 0.0825965 5.433 5.68e-08 ***
room_typePrivate room -0.3987332 0.0148617 -26.830 < 2e-16 ***
room_typeShared room -0.9492805 0.0738752 -12.850 < 2e-16 ***
bedrooms 0.3614135 0.0100810 35.851 < 2e-16 ***
host_is_superhostTRUE -0.0544075 0.0142720 -3.812 0.000139 ***
Residual standard error: 0.5641 on 10019 degrees of freedom
(4034 observations deleted due to missingness)
Multiple R-squared: 0.2225, Adjusted R-squared: 0.2219
F-statistic: 409.6 on 7 and 10019 DF, p-value: < 2.2e-16
car::vif(model5) GVIF Df GVIF^(1/(2*Df))
number_of_reviews 1.069533 1 1.034182
review_scores_rating 1.033888 1 1.016803
room_type 1.043856 3 1.007179
bedrooms 1.038241 1 1.018941
host_is_superhost 1.093506 1 1.045708
Comments: After running the model with the additional variable of “Host is superhost”, we observed that the variable “host_is_super host” is significant variable in determining the price (with the absolute value of t-stats greater than 2), but with a relatively lower negative correlation. It is reasonable that “host_is_super host” is based on the quality of service provided. Generally, the tourists will consider the “value for money” as a key factor for giving reviews to the service providers. So we think the “superhosts” might have relatively lower price for services of same level of quality. It could be the reason why the two variables showing negative relation.
We also see that the adjusted R-square has slightly increased from 0.2208 to 0.2219 since we added the new variable regarding the superhost status, which demonstrates that the variable does contribute to the variation of the price. The new variable made the regression model explain more about the variation of the prices in Milan.
Q3. Some hosts allow you to immediately book their listing (instant_bookable == TRUE), while a non-trivial proportion don’t. After controlling for other variables, is instant_bookable a significant predictor of price_4_nights?
model6 <- lm(price_4_nights ~ #adding instant_bookable
number_of_reviews +
review_scores_rating +
room_type+
bedrooms+
host_is_superhost+
instant_bookable,
data = log_listings_4_nights_2_people)
autoplot(model6)+ theme_bw()get_regression_table(model6) | term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | 5.59 | 0.037 | 152 | 0 | 5.51 | 5.66 |
| number_of_reviews | -0.001 | 0 | -13.8 | 0 | -0.001 | -0.001 |
| review_scores_rating | -0.019 | 0.007 | -2.65 | 0.008 | -0.033 | -0.005 |
| room_type: Hotel room | 0.406 | 0.083 | 4.91 | 0 | 0.244 | 0.567 |
| room_type: Private room | -0.384 | 0.015 | -25.6 | 0 | -0.413 | -0.354 |
| room_type: Shared room | -0.937 | 0.074 | -12.7 | 0 | -1.08 | -0.792 |
| bedrooms | 0.362 | 0.01 | 36.1 | 0 | 0.343 | 0.382 |
| host_is_superhostTRUE | -0.061 | 0.014 | -4.25 | 0 | -0.089 | -0.033 |
| instant_bookableTRUE | 0.089 | 0.012 | 7.67 | 0 | 0.066 | 0.111 |
get_regression_summaries(model6)| r_squared | adj_r_squared | mse | rmse | sigma | statistic | p_value | df | nobs |
|---|---|---|---|---|---|---|---|---|
| 0.227 | 0.226 | 0.316 | 0.562 | 0.562 | 368 | 0 | 8 | 1e+04 |
mosaic::msummary(model6) Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.586e+00 3.669e-02 152.258 < 2e-16 ***
number_of_reviews -1.151e-03 8.354e-05 -13.774 < 2e-16 ***
review_scores_rating -1.904e-02 7.187e-03 -2.649 0.00809 **
room_typeHotel room 4.055e-01 8.255e-02 4.912 9.15e-07 ***
room_typePrivate room -3.835e-01 1.495e-02 -25.649 < 2e-16 ***
room_typeShared room -9.368e-01 7.368e-02 -12.715 < 2e-16 ***
bedrooms 3.625e-01 1.005e-02 36.057 < 2e-16 ***
host_is_superhostTRUE -6.057e-02 1.425e-02 -4.249 2.17e-05 ***
instant_bookableTRUE 8.872e-02 1.157e-02 7.668 1.90e-14 ***
Residual standard error: 0.5625 on 10018 degrees of freedom
(4034 observations deleted due to missingness)
Multiple R-squared: 0.227, Adjusted R-squared: 0.2264
F-statistic: 367.8 on 8 and 10018 DF, p-value: < 2.2e-16
car::vif(model6) GVIF Df GVIF^(1/(2*Df))
number_of_reviews 1.076657 1 1.037621
review_scores_rating 1.039901 1 1.019755
room_type 1.068460 3 1.011097
bedrooms 1.038440 1 1.019039
host_is_superhost 1.096988 1 1.047372
instant_bookable 1.041074 1 1.020330
Comments: After adding the “instant_bookable” variable, we observed that the Adjusted R-Square has increased further from the previous 0.2219 to 0.2264. We have concluded that the new model explains more variation of the prices and makes the regression model even stronger.
The stats show that “Instant_bookable” is a statistically significant variable with a positive coefficient of 0.089, illustrating the positive relationship between being instant-bookable and the prices. Firstly, the “instantly_bookable” feature offers more flexible choices for the customer and save the time for approval. The customers with urgent demands tend to have higher willingness to pay, resulting in the relatively higher prices of the corresponding rooms on Airbnb. Secondly, the feature requires high response rates and extremely flexible arrangement of the home owner when they received instantly booked orders, which drives up their operating cost, therefore increasing the market prices.
We conclude that we need to keep this variables in the regression model to proceed further regression analysis.
Q4.Is neighbourhood_simplified a predictor of price_4_nights?
model7 <- lm(price_4_nights ~ #Adding neighbourhood_simplified
number_of_reviews +
review_scores_rating +
room_type+
bedrooms+
host_is_superhost+
instant_bookable+
neighbourhood_simplified,
data = log_listings_4_nights_2_people)
autoplot(model7)+ theme_bw()get_regression_table(model7) | term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | 5.62 | 0.037 | 151 | 0 | 5.55 | 5.7 |
| number_of_reviews | -0.001 | 0 | -14.2 | 0 | -0.001 | -0.001 |
| review_scores_rating | -0.019 | 0.007 | -2.59 | 0.01 | -0.033 | -0.005 |
| room_type: Hotel room | 0.385 | 0.082 | 4.67 | 0 | 0.223 | 0.546 |
| room_type: Private room | -0.379 | 0.015 | -25.4 | 0 | -0.408 | -0.35 |
| room_type: Shared room | -0.943 | 0.073 | -12.8 | 0 | -1.09 | -0.799 |
| bedrooms | 0.364 | 0.01 | 36.3 | 0 | 0.344 | 0.383 |
| host_is_superhostTRUE | -0.063 | 0.014 | -4.42 | 0 | -0.091 | -0.035 |
| instant_bookableTRUE | 0.086 | 0.012 | 7.48 | 0 | 0.064 | 0.109 |
| neighbourhood_simplified: Northwest | -0.138 | 0.016 | -8.58 | 0 | -0.169 | -0.106 |
| neighbourhood_simplified: Southeast | -0.043 | 0.014 | -3.1 | 0.002 | -0.071 | -0.016 |
| neighbourhood_simplified: Southwest | -0.048 | 0.018 | -2.68 | 0.007 | -0.084 | -0.013 |
get_regression_summaries(model7)| r_squared | adj_r_squared | mse | rmse | sigma | statistic | p_value | df | nobs |
|---|---|---|---|---|---|---|---|---|
| 0.233 | 0.232 | 0.314 | 0.56 | 0.56 | 276 | 0 | 11 | 1e+04 |
mosaic::msummary(model7) Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.624e+00 3.715e-02 151.379 < 2e-16 ***
number_of_reviews -1.183e-03 8.339e-05 -14.183 < 2e-16 ***
review_scores_rating -1.856e-02 7.164e-03 -2.591 0.00959 **
room_typeHotel room 3.846e-01 8.232e-02 4.673 3.01e-06 ***
room_typePrivate room -3.790e-01 1.492e-02 -25.406 < 2e-16 ***
room_typeShared room -9.425e-01 7.342e-02 -12.837 < 2e-16 ***
bedrooms 3.637e-01 1.002e-02 36.288 < 2e-16 ***
host_is_superhostTRUE -6.279e-02 1.421e-02 -4.420 9.96e-06 ***
instant_bookableTRUE 8.645e-02 1.156e-02 7.479 8.11e-14 ***
neighbourhood_simplifiedNorthwest -1.377e-01 1.605e-02 -8.579 < 2e-16 ***
neighbourhood_simplifiedSoutheast -4.325e-02 1.395e-02 -3.099 0.00194 **
neighbourhood_simplifiedSouthwest -4.835e-02 1.801e-02 -2.684 0.00728 **
Residual standard error: 0.5605 on 10015 degrees of freedom
(4034 observations deleted due to missingness)
Multiple R-squared: 0.2327, Adjusted R-squared: 0.2319
F-statistic: 276.1 on 11 and 10015 DF, p-value: < 2.2e-16
car::vif(model7) GVIF Df GVIF^(1/(2*Df))
number_of_reviews 1.080520 1 1.039481
review_scores_rating 1.040350 1 1.019976
room_type 1.072696 3 1.011765
bedrooms 1.039201 1 1.019412
host_is_superhost 1.097400 1 1.047569
instant_bookable 1.046336 1 1.022906
neighbourhood_simplified 1.016307 3 1.002700
Comments: After running model 7, we found out that neighbourhood also has the explanatory power in predicting the price at 5% signifiance level. More specifically, Airbnb located in Northwest, Southeast, and Southwest would tend to have a lower price than that in Northeast region.
Q5. What is the effect of avalability_30 or reviews_per_month on price_4_nights, after we control for other variables?
model8 <- lm(price_4_nights ~ #Adding availability_30
number_of_reviews +
review_scores_rating +
room_type+
bedrooms+
host_is_superhost+
instant_bookable+
neighbourhood_simplified+
availability_30,
data = log_listings_4_nights_2_people)
autoplot(model8)+ theme_bw()get_regression_table(model8) | term | estimate | std_error | statistic | p_value | lower_ci | upper_ci |
|---|---|---|---|---|---|---|
| intercept | 5.47 | 0.036 | 153 | 0 | 5.4 | 5.54 |
| number_of_reviews | -0.001 | 0 | -13.8 | 0 | -0.001 | -0.001 |
| review_scores_rating | -0.017 | 0.007 | -2.54 | 0.011 | -0.031 | -0.004 |
| room_type: Hotel room | 0.237 | 0.079 | 3.01 | 0.003 | 0.083 | 0.391 |
| room_type: Private room | -0.404 | 0.014 | -28.4 | 0 | -0.432 | -0.376 |
| room_type: Shared room | -0.964 | 0.07 | -13.8 | 0 | -1.1 | -0.827 |
| bedrooms | 0.362 | 0.01 | 38 | 0 | 0.344 | 0.381 |
| host_is_superhostTRUE | -0.045 | 0.014 | -3.35 | 0.001 | -0.072 | -0.019 |
| instant_bookableTRUE | 0.111 | 0.011 | 10.1 | 0 | 0.089 | 0.133 |
| neighbourhood_simplified: Northwest | -0.132 | 0.015 | -8.64 | 0 | -0.162 | -0.102 |
| neighbourhood_simplified: Southeast | -0.032 | 0.013 | -2.43 | 0.015 | -0.058 | -0.006 |
| neighbourhood_simplified: Southwest | -0.046 | 0.017 | -2.71 | 0.007 | -0.08 | -0.013 |
| availability_30 | 0.017 | 0.001 | 32 | 0 | 0.016 | 0.018 |
get_regression_summaries(model8)| r_squared | adj_r_squared | mse | rmse | sigma | statistic | p_value | df | nobs |
|---|---|---|---|---|---|---|---|---|
| 0.304 | 0.303 | 0.285 | 0.534 | 0.534 | 364 | 0 | 12 | 1e+04 |
mosaic::msummary(model8) Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.469e+00 3.572e-02 153.118 < 2e-16 ***
number_of_reviews -1.095e-03 7.949e-05 -13.776 < 2e-16 ***
review_scores_rating -1.731e-02 6.825e-03 -2.536 0.011234 *
room_typeHotel room 2.366e-01 7.856e-02 3.012 0.002606 **
room_typePrivate room -4.039e-01 1.423e-02 -28.383 < 2e-16 ***
room_typeShared room -9.642e-01 6.995e-02 -13.784 < 2e-16 ***
bedrooms 3.624e-01 9.547e-03 37.958 < 2e-16 ***
host_is_superhostTRUE -4.532e-02 1.354e-02 -3.346 0.000823 ***
instant_bookableTRUE 1.110e-01 1.104e-02 10.053 < 2e-16 ***
neighbourhood_simplifiedNorthwest -1.321e-01 1.530e-02 -8.636 < 2e-16 ***
neighbourhood_simplifiedSoutheast -3.236e-02 1.330e-02 -2.433 0.014974 *
neighbourhood_simplifiedSouthwest -4.649e-02 1.716e-02 -2.709 0.006759 **
availability_30 1.671e-02 5.229e-04 31.951 < 2e-16 ***
Residual standard error: 0.534 on 10014 degrees of freedom
(4034 observations deleted due to missingness)
Multiple R-squared: 0.3037, Adjusted R-squared: 0.3029
F-statistic: 364 on 12 and 10014 DF, p-value: < 2.2e-16
car::vif(model8) GVIF Df GVIF^(1/(2*Df))
number_of_reviews 1.081808 1 1.040100
review_scores_rating 1.040385 1 1.019993
room_type 1.079578 3 1.012843
bedrooms 1.039219 1 1.019421
host_is_superhost 1.099192 1 1.048423
instant_bookable 1.051417 1 1.025386
neighbourhood_simplified 1.017011 3 1.002815
availability_30 1.017988 1 1.008954
Comments:Following the addition of the variable ‘availability_30’ the r-squared value increased to 0.304 from 0.233. This is good increase and suggests that this model is a stronger indicator of the variation of prices - suggesting this is a stronger regression model. Additionally the t-statistic, at 31.95, is a very strong indication that this is a significant variable. The reason for this variable being significant in the price of the property would be because cheaper properties are likely to be rented first leaving more expensive properties on the site. This explains the positive coefficient.
Due to the strong significance we will keep the ‘availability_30’ variable in the model.
Additional Factors That Might Imporve the Model: Apart from all the variables given in the dataframe, some other factors that might help explain the price include “distance to Duomo di Milano”: the closer to the Cathedral, the more expensive is the Airbnb, since it brings more convenience to the visitor to travel around Milan. This would not lead to co-linearity since the way we group neightborhood would not tell us about the distance to central Milan. In addition, “season” would have some explanatory power as well, since different seasons would have different number of visitors, which would in turn affect the demand for Airbnb and hence price.
huxtable (https://mfa2022.netlify.app/example/modelling_side_by_side_tables/) that shows which models you worked on, which predictors are significant, the adjusted \(R^2\), and the Residual Standard Error.huxreg(model1,model2,model3,model4,model5,model6,model7,model8,
statistics = c('#observations' = 'nobs',
'R squared' = 'r.squared',
'Adj. R Squared' = 'adj.r.squared',
'Residual SE' = 'sigma'),
bold_signif = 0.05
)| (1) | (2) | (3) | (4) | (5) | (6) | (7) | (8) | |
|---|---|---|---|---|---|---|---|---|
| (Intercept) | 6.013 *** | 6.016 *** | 5.442 *** | 5.654 *** | 5.644 *** | 5.586 *** | 5.624 *** | 5.469 *** |
| (0.039) | (0.039) | (0.037) | (0.036) | (0.036) | (0.037) | (0.037) | (0.036) | |
| prop_type_simplifiedEntire loft | 0.173 *** | 0.172 *** | ||||||
| (0.032) | (0.032) | |||||||
| prop_type_simplifiedEntire rental unit | 0.094 *** | 0.093 *** | ||||||
| (0.021) | (0.021) | |||||||
| prop_type_simplifiedOther | -0.093 *** | 0.327 *** | ||||||
| (0.027) | (0.036) | |||||||
| prop_type_simplifiedPrivate room in rental unit | -0.389 *** | 0.274 *** | ||||||
| (0.026) | (0.047) | |||||||
| number_of_reviews | -0.001 *** | -0.001 *** | -0.001 *** | -0.001 *** | -0.001 *** | -0.001 *** | -0.001 *** | -0.001 *** |
| (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | (0.000) | |
| review_scores_rating | -0.029 *** | -0.029 *** | -0.027 *** | -0.027 *** | -0.023 ** | -0.019 ** | -0.019 ** | -0.017 * |
| (0.007) | (0.007) | (0.007) | (0.007) | (0.007) | (0.007) | (0.007) | (0.007) | |
| room_typeHotel room | 0.194 * | 0.497 *** | 0.441 *** | 0.449 *** | 0.406 *** | 0.385 *** | 0.237 ** | |
| (0.091) | (0.083) | (0.083) | (0.083) | (0.083) | (0.082) | (0.079) | ||
| room_typePrivate room | -0.664 *** | -0.365 *** | -0.397 *** | -0.399 *** | -0.384 *** | -0.379 *** | -0.404 *** | |
| (0.039) | (0.016) | (0.015) | (0.015) | (0.015) | (0.015) | (0.014) | ||
| room_typeShared room | -1.262 *** | -0.919 *** | -0.943 *** | -0.949 *** | -0.937 *** | -0.943 *** | -0.964 *** | |
| (0.083) | (0.073) | (0.074) | (0.074) | (0.074) | (0.073) | (0.070) | ||
| bathrooms_clean | 0.246 *** | |||||||
| (0.017) | ||||||||
| bedrooms | 0.185 *** | 0.361 *** | 0.361 *** | 0.362 *** | 0.364 *** | 0.362 *** | ||
| (0.015) | (0.010) | (0.010) | (0.010) | (0.010) | (0.010) | |||
| beds | -0.020 ** | |||||||
| (0.007) | ||||||||
| accommodates | 0.054 *** | |||||||
| (0.006) | ||||||||
| host_is_superhostTRUE | -0.054 *** | -0.061 *** | -0.063 *** | -0.045 *** | ||||
| (0.014) | (0.014) | (0.014) | (0.014) | |||||
| instant_bookableTRUE | 0.089 *** | 0.086 *** | 0.111 *** | |||||
| (0.012) | (0.012) | (0.011) | ||||||
| neighbourhood_simplifiedNorthwest | -0.138 *** | -0.132 *** | ||||||
| (0.016) | (0.015) | |||||||
| neighbourhood_simplifiedSoutheast | -0.043 ** | -0.032 * | ||||||
| (0.014) | (0.013) | |||||||
| neighbourhood_simplifiedSouthwest | -0.048 ** | -0.046 ** | ||||||
| (0.018) | (0.017) | |||||||
| availability_30 | 0.017 *** | |||||||
| (0.001) | ||||||||
| #observations | 10871 | 10871 | 9971 | 10028 | 10027 | 10027 | 10027 | 10027 |
| R squared | 0.081 | 0.118 | 0.246 | 0.221 | 0.222 | 0.227 | 0.233 | 0.304 |
| Adj. R Squared | 0.080 | 0.118 | 0.245 | 0.221 | 0.222 | 0.226 | 0.232 | 0.303 |
| Residual SE | 0.606 | 0.594 | 0.555 | 0.565 | 0.564 | 0.562 | 0.560 | 0.534 |
| *** p < 0.001; ** p < 0.01; * p < 0.05. | ||||||||
price_4_nights.filtered_dataset <- listings %>%
filter(number_of_reviews >= 10,review_scores_rating >= 4.5, room_type == "Private room")
model_prediction <-
data.frame(predict(model8, newdata = filtered_dataset, interval = "prediction")) %>%
mutate(price = exp(fit),
CI_lower = exp(lwr),
CI_upper = exp(upr)) %>%
select(-fit, -lwr, -upr)
head(model_prediction)| price | CI_lower | CI_upper |
|---|---|---|
| 606 | 212 | 1.73e+03 |
| 215 | 75.5 | 613 |
| 146 | 51.1 | 417 |
| 196 | 68.9 | 560 |
| 207 | 72.7 | 591 |
| 193 | 67.8 | 550 |
ggplot(model_prediction, aes(x = price)) +
geom_density()+
labs(title="Price Distribution of Suitable Airbnb", x="Pricing") +
theme(axis.text.y = element_blank()) In the final data frame we can observe the Predicted price and the 95% confidence intervals. The predicted price has been calculated using Model 8 that has an R2 of 0.28. The low R2 is responsible for the large Confidence Intervals values.